GUIDE FOR COMPREHENSIVE DATA EXPLORATION WITH PYTHON

Vikum Wijesinghe - September 2019

Other Kernels: https://www.kaggle.com/vikumsw/kernels


Data Exploration?. Where to start?. How to Start?. What to use?. How to choose???????

Assume you are a beginner with limited knowlegde on data exploration. You get to know that there is an huge collection of tools out there for Data Exploration : Matplotlib, Seaborn, ggplot, Bokeh, Plotly, Pygal, Altair, Geoplotlib, Gleam, Missingno and etc. Which one should you use? How Use?, This broad availability itself has created a confusion. Objective of this kernel is to clear that confusion and make you competent in the beautiful world of EDA (Exploratory Data Analysis)

One thing to remember... Does not matter which tool we use, how we use it as long as we get what we looking for which is insights on data!. So let's forget about the tools, let's focus on getting the JOB DONE!.

Two friends gonna join with us for today's lesson. Let's wait for a them, they should be here in any minute.

Here they are!. Let's welcome SpiderMan and IronMan. Today we are joining the class where IronMan teach Data Exploration to SpiderMan. Now let's be quite and listen to them.

Iron Man : Hi Spidy, Tell me what you wanna learn?

SpiderMan : I was studing on data exploration with python, but after minutes into it, I was feeling confused. I know that there is an huge collection of tools out there for Data Exploration : Matplotlib, Seaborn, ggplot, Bokeh, Plotly, Pygal, Altair, Geoplotlib, Gleam, Missingno and etc. But Which one should I use? How to Use?, This broad availability itself has created a confusion.

Iron Man : Ok. I get it. So what do you expect from this session?.

SpiderMan : What are the things we do in EDA? I mean like, is there a sequence of tasks?. What tools to use?. How to use?... and....

Iron Man : Okay. I get it.. It's best to explain the process while practicing. for demonstration let's use the two most known datasets in kaggle, the housing data set and titanic dataset. Are you ready spidy?.

SpiderMan : More Than ever boss!

Iron Man : Let's get the party started!.

Iron Man : We need party people here.. Let's invite them

In [1]:
#Inviting Party People
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
import os
print(os.listdir("../input"))

#Load datasets for demonstrations
titanic_data = pd.read_csv("../input/titanic/train.csv")
house_data = pd.read_csv("../input/house-prices-advanced-regression-techniques/train.csv")
['house-prices-advanced-regression-techniques', 'titanic']

Iron Man : First and foremost it is important to have a look at data to get a clear sense on what we work on.

Peek Data,Setting the context ^

Let's get an idea ab at the data using following pandas functions.

  • DataFrame.head()
  • DataFrame.tail()
  • DataFrame.shape
  • DataFrame.columns
In [2]:
#look at first 5 rows using .head()
house_data.head()

#Wanna see more?. try -> house_data.head(13) for first 13 rows.
Out[2]:
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
0 1 60 RL 65.0 8450 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 2 2008 WD Normal 208500
1 2 20 RL 80.0 9600 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 5 2007 WD Normal 181500
2 3 60 RL 68.0 11250 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 9 2008 WD Normal 223500
3 4 70 RL 60.0 9550 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 2 2006 WD Abnorml 140000
4 5 60 RL 84.0 14260 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 12 2008 WD Normal 250000

5 rows × 81 columns

In [3]:
#look at last 5 rows using .tail()
house_data.tail()
Out[3]:
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
1455 1456 60 RL 62.0 7917 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 8 2007 WD Normal 175000
1456 1457 20 RL 85.0 13175 Pave NaN Reg Lvl AllPub ... 0 NaN MnPrv NaN 0 2 2010 WD Normal 210000
1457 1458 70 RL 66.0 9042 Pave NaN Reg Lvl AllPub ... 0 NaN GdPrv Shed 2500 5 2010 WD Normal 266500
1458 1459 20 RL 68.0 9717 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 4 2010 WD Normal 142125
1459 1460 20 RL 75.0 9937 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 6 2008 WD Normal 147500

5 rows × 81 columns

In [4]:
house_data.shape
Out[4]:
(1460, 81)
In [5]:
house_data.columns
Out[5]:
Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
       'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
       'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
       'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd',
       'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
       'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
       'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
       'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating',
       'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
       'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType',
       'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual',
       'GarageCond', 'PavedDrive', 'WoodDeckSF', 'OpenPorchSF',
       'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'PoolQC',
       'Fence', 'MiscFeature', 'MiscVal', 'MoSold', 'YrSold', 'SaleType',
       'SaleCondition', 'SalePrice'],
      dtype='object')

SpiderMan : I see. Our dataset is has 1460 rows, 81 columns. That is how to start right?. But thats a lots of data. Looks bit confusing...

Iron Man : That's True.. Because of it let's focus on a single feature and study it. It's called Univariate Analysis.. I read that part for you. Listen carefully?.


Univariate Analysis ^

Analysis of a numerical feature ^

Okay!, Time to look at the problem of analysing a numerical feature. Say you have given a numerical feature named 'SalePrice' and you are expected to explore it. What are the things you would do?. Feeling confused or feel incompetent?. Do not worry, its very simple when you get it right.

What is numerical data? First, let's be clear about what is numerical data : Numerical data have meaning as a measurement, such as a person’s height, weight, IQ, or blood pressure; or they’re a count, such as the number of stock shares a person owns, how many teeth a dog has, or how many pages you can read of your favorite book before you fall asleep. Statisticians also call numerical data quantitative data. Read more here

Two Types of Numerical Data : Discrete Data & Continuous Data

What are we looking for? : What things would help us develop an understanding on the numerical feature?. There are basic set of stats & figures we use as presented below.

  • Number of observations
  • Mean
  • Standard Deviation
  • Max, Min
  • Interquartile range (IQR)
  • Histograms
  • Skewness and Kurtosis

How to ? life is much easier when what you want is only few code lines away. There are few basic things that we can do,

  1. Peek at data : everybody loves to look at what they work on.. :D
  2. Descriptive statistics summary
  3. Distribution plot
  4. Skewness and Kurtosis

Let's use 'SalePrice' column of the housing dataset for demonstation. This feature contains a list of sales prices for houses. That's all we know.. for now..:)

In [6]:
#Peek... head or tail
house_data['SalePrice'].head()
Out[6]:
0    208500
1    181500
2    223500
3    140000
4    250000
Name: SalePrice, dtype: int64

SpiderMan : Ah interesting!. A numerical feature with big values.

In [7]:
# Descriptive statistics summary
house_data['SalePrice'].describe()
Out[7]:
count      1460.000000
mean     180921.195890
std       79442.502883
min       34900.000000
25%      129975.000000
50%      163000.000000
75%      214000.000000
max      755000.000000
Name: SalePrice, dtype: float64

SpiderMan : Thats some valueble info we got from only one line of code!. Let me put it into words ... We got 1460 values with a mean value of approx 180921, showing a std value of 79442.5. Minimum feature value is observed as 34900 while max value is reported as 755000.

In [8]:
from scipy.stats import norm
# Distribution plot
def distribution_plot(data):
    sns.distplot(data, fit=norm)
    plt.ylabel('Frequency')
    plt.title(f'{data.name} distribution')
    
distribution_plot(house_data['SalePrice'])
In [9]:
#skewness and kurtosis
print("Skewness: %f" % house_data['SalePrice'].skew())
print("Kurtosis: %f" % house_data['SalePrice'].kurt())
Skewness: 1.882876
Kurtosis: 6.536282

SpiderMan : Distribution plot looks interesting. Even though the values spread from 34900 to 755000. Most of the values are at between 100000 and 200000.

Iron Man : That is univariate study of a numerical feature. Let's see what to do when we got a categorical variable. Like you did earlier.. listen carefully...

Analysis of a categorical feature ^

Next, Let me introduce you to the problem of analysing a categorical feature. Say you have given a categorical feature named 'OverallQual'(Overall material and finish quality) and you are expected to explore it. What are the things you would do?. Feeling confused or feel incompetent?. Do not worry, its very simple when you get it right.

What is categorical data? First, let's be clear about what is categorical data : Categorical data represent characteristics such as a person’s gender, marital status, hometown, or the types of movies they like. Categorical data can take on numerical values (such as “1” indicating male and “2” indicating female), but those numbers don’t have mathematical meaning. You couldn’t add them together, for example. Statisticians also call categorical data , qualitative data. Read more here

Two Types of Categorical Data : Nominal Data & Ordinal Data

What are we looking for? : What things would help us develop an understanding on a categorical feature?. There are basic set of stats & figures we use as presented below.

  • Number of observations
  • Cardinality : How many categories are there?
  • What are the diffrent categories?
  • What is the mot common category?
  • Values counts for each category.
  • Values percentages for each category.

How to ? life is much easier when what you want is only few code lines away. There are few basic things that we can do,

  • Peek at data : everybody loves to look at what they work on.. :D
  • Descriptive statistics summary
  • Value count plots.
  • Pie Graph for precentages of each category.

Let's use 'OverallQual' column of the housing dataset for demonstation. This feature contains a list of Overall material and finish quality for houses. That's all we know.. and need to know for now..:)

In [10]:
OverallQual = house_data['OverallQual'].astype('category')

#Peek... head or tail
OverallQual.head()
Out[10]:
0    7
1    6
2    7
3    7
4    8
Name: OverallQual, dtype: category
Categories (10, int64): [1, 2, 3, 4, ..., 7, 8, 9, 10]

SpiderMan : ah that is some useful info .. OverallQual feature got 10 categories.

In [11]:
# Descriptive statistics summary
OverallQual.describe()
Out[11]:
count     1460
unique      10
top          5
freq       397
Name: OverallQual, dtype: int64

SpiderMan : I see that we are getting answers to our questions fast.. There are 10 categories, the most frequent is '5' with a count of 397 out of 1460.

In [12]:
column = OverallQual;
print('Column Name:{}\nCardinality:{}\nValues:{}'.format(column.name,column.nunique(), column.unique()))
Column Name:OverallQual
Cardinality:10
Values:[7, 6, 8, 5, 9, 4, 10, 3, 1, 2]
Categories (10, int64): [7, 6, 8, 5, ..., 10, 3, 1, 2]
In [13]:
OverallQual.value_counts()
Out[13]:
5     397
6     374
7     319
8     168
4     116
9      43
3      20
10     18
2       3
1       2
Name: OverallQual, dtype: int64
In [14]:
def getPlotsforCatFeature(series,figX=15,figY=7):
    f,ax=plt.subplots(1,2,figsize=(figX,figY))
    series.value_counts().plot.pie(autopct='%1.1f%%',ax=ax[0])
    ax[0].set_title(f'{series.name}')
    ax[0].set_ylabel('')
    sns.countplot(series,ax=ax[1])
    ax[1].set_title(f'Count plot for {series.name}')
    plt.show()
    
getPlotsforCatFeature(OverallQual,15,5)

SpiderMan : That is value counts for each category. Category '5' is ranked first at a count of 397, closely followed by '6' & '7' with counts 374 and 319 respectively. Lowest counts are reported from categories '1' & '2' with just 5 observations even when combined. Great!...

I think I have an idea on univariate analysis now. But what about the relationships between features?

Iron Man : That's what we look for.. Let's Analyse feature pairs at ones looking for relationships.. It's called Bivariate Analysis..


Bivariate Analysis ^

Relationship of a numerical feature with another numerical feature ^

Most often we feel curious on how two numerical features behave wrt each other. Following techniques helps us to develop insights on those hidden relationships.

  • Scatterplot

For demonstration lets use 'GrLivArea'(groundLivingArea) and 'SalePrice' from housing dataset

In [15]:
#scatter plot
house_data.plot.scatter(x='GrLivArea', y='SalePrice');

''' Alternatively you could use following function 
def scatterplot(seriesX,seriesY):
    data = pd.concat([seriesY, seriesX], axis=1)
    data.plot.scatter(x=seriesX.name, y=seriesY.name)
    
scatterplot(house_data['GrLivArea'],house_data['SalePrice'])
'''
Out[15]:
" Alternatively you could use following function \ndef scatterplot(seriesX,seriesY):\n    data = pd.concat([seriesY, seriesX], axis=1)\n    data.plot.scatter(x=seriesX.name, y=seriesY.name)\n    \nscatterplot(house_data['GrLivArea'],house_data['SalePrice'])\n"

SpiderMan : It seems that 'SalePrice' and 'GrLivArea' are good friends, with a linear relationship.

Iron Man : Nice.. You are getting it.. Next lets see how to find relationships of a numerical feature with a categorical feature

Relationship of a numerical feature with a categorical feature ^

Let's try to visualize a relationship between a numerical feature and a categorical feature. Lets use Sale Price as the numerical feature and OverallQual which indicates Overall material and finish quality as the categorical feature from housing dataset. I know what you are thinking... we expect to see sale price increase with overall quality.. Lets see whether we could see that using following techniques,

  • Box Plot
In [16]:
#Box plot
num = 'SalePrice'
cat = 'OverallQual'
df  =  house_data

data = pd.concat([df[num], df[cat]], axis=1)
f, ax = plt.subplots(figsize=(8, 6))
fig = sns.boxplot(x=cat, y=num, data=data)
fig.axis(ymin=0, ymax=800000);

SpiderMan : How beautiful! Just as we expected! Sales Price increses with OverallQual(Overall material and finish quality). Shall we do this to analyse relationship with SalePrice for few more categorical columns?

In [17]:
def boxplot(x, y, **kwargs):
    sns.boxplot(x=x, y=y)
    x=plt.xticks(rotation=90)

def fillMissingCatColumns(data,categorical):
    for c in categorical:
        data[c] = data[c].astype('category')
        if data[c].isnull().any():
            data[c] = data[c].cat.add_categories(['MISSING'])
            data[c] = data[c].fillna('MISSING')
    
def getboxPlots(data,var,categorical):
    fillMissingCatColumns(data,categorical)
    f = pd.melt(data, id_vars=var, value_vars=categorical)
    g = sns.FacetGrid(f, col="variable",  col_wrap=2, sharex=False, sharey=False, size=5)
    g = g.map(boxplot, "value", var)
    

data = house_data.copy()
categorical = [f for f in data.columns if data.dtypes[f] == 'object']    
getboxPlots(data,'SalePrice',categorical)
/opt/conda/lib/python3.6/site-packages/seaborn/axisgrid.py:230: UserWarning: The `size` paramter has been renamed to `height`; please update your code.
  warnings.warn(msg, UserWarning)

Iron Man : That's what you asked for... Going through each one to identify relationships is your homework.. :D. Guess whats next..

SpiderMan : mm... Correlation Analysis??

Iron Man : Nope!. it is tea time!...

With some cake!


Correlation Analysis^

Correlation Heat Map^

large correlations between features is one of the best indicators used for feature selection. If we have a data set with many columns, a good way to quickly check correlations among columns is by visualizing the correlation matrix as a heatmap.

In [18]:
def getCorrHeatMap(dataFrame,figSize=[12,9]):
    corrmat = dataFrame.corr()
    f, ax = plt.subplots(figsize=(figSize[0], figSize[1]))
    sns.heatmap(corrmat, vmax=.8, square=True);

getCorrHeatMap(house_data)

Iron Man : When its lighter(Closer to white) that means it has a larger positive correlation whereas darker cells mean a larger negative correlation.

Zoomed Heat Map^

Iron Man : We are more interested about larger correlations. So we can filter the columns and get a heatmap showing only larger correlations with feature 'SalePrice'.

In [19]:
def getZoomedCorrHeatMap(dataFrame,featureCount,target,figSize=[12,9]):
    corrmat = dataFrame.corr()
    cols = corrmat.nlargest(featureCount, target)[target].index
    f , ax = plt.subplots(figsize = (figSize[0],figSize[1]))
    cm = np.corrcoef(dataFrame[cols].values.T)
    sns.set(font_scale=1.25)
    hm = sns.heatmap(cm, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size': 10}, yticklabels=cols.values, xticklabels=cols.values)
    plt.show()

getZoomedCorrHeatMap(house_data,10,'SalePrice',[10,8])

Top two features having highet correlations include OverallQual(Overall material and finish quality) with 0.79 and GrLivArea with 0.71.

SpiderMan :


SpiderMan : Wow great! Is that it?. Any more tips!

Iron Man : Yes!. go through the kernel below... I have noted down some tips for you..


Investgation of missing values^

What's missing? to what extent?^

In [20]:
def getMissingValuesInfo(df):
    total = df.isnull().sum().sort_values(ascending = False)
    percent = round(df.isnull().sum().sort_values(ascending = False)/len(df)*100, 2)
    temp = pd.concat([total, percent], axis = 1,keys= ['Total Missing Count', '% of Total Observations'])
    temp.index.name ='Feature Name'
    return temp.loc[(temp['Total Missing Count'] > 0)]

getMissingValuesInfo(house_data)
Out[20]:
Total Missing Count % of Total Observations
Feature Name
PoolQC 1453 99.52
MiscFeature 1406 96.30
Alley 1369 93.77
Fence 1179 80.75
FireplaceQu 690 47.26
LotFrontage 259 17.74
GarageCond 81 5.55
GarageType 81 5.55
GarageYrBlt 81 5.55
GarageFinish 81 5.55
GarageQual 81 5.55
BsmtExposure 38 2.60
BsmtFinType2 38 2.60
BsmtFinType1 37 2.53
BsmtCond 37 2.53
BsmtQual 37 2.53
MasVnrArea 8 0.55
MasVnrType 8 0.55
Electrical 1 0.07

Visualizing missing values in a dataframe^

In [21]:
# Visualizing missing counts
missing = house_data.isnull().sum()
missing = missing[missing > 0]
missing.sort_values(inplace=True)
plt.subplots(figsize=(15,5))
missing.plot.bar()
plt.show()
In [22]:
fig, ax = plt.subplots(figsize=(20,5))
sns.heatmap(house_data.isnull(), cbar=False, cmap="YlGnBu_r")
plt.show()

White spaces shows the missing value in the data frame.


Distribution plots for list of numerical features^

In [23]:
def distplots(data,num_features):
    f = pd.melt(data, value_vars=num_features)
    g = sns.FacetGrid(f, col="variable",  col_wrap=2, sharex=False, sharey=False)
    g = g.map(sns.distplot, "value")
    

num_features = house_data.select_dtypes(include=['int64','float64'])
distplots(house_data,num_features)

Describing Categorical and Numerical features separately^

In [24]:
num_features = house_data.select_dtypes(include=['int64','float64'])
num_features.describe()
Out[24]:
Id MSSubClass LotFrontage LotArea OverallQual OverallCond YearBuilt YearRemodAdd MasVnrArea BsmtFinSF1 ... WoodDeckSF OpenPorchSF EnclosedPorch 3SsnPorch ScreenPorch PoolArea MiscVal MoSold YrSold SalePrice
count 1460.000000 1460.000000 1201.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1452.000000 1460.000000 ... 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000
mean 730.500000 56.897260 70.049958 10516.828082 6.099315 5.575342 1971.267808 1984.865753 103.685262 443.639726 ... 94.244521 46.660274 21.954110 3.409589 15.060959 2.758904 43.489041 6.321918 2007.815753 180921.195890
std 421.610009 42.300571 24.284752 9981.264932 1.382997 1.112799 30.202904 20.645407 181.066207 456.098091 ... 125.338794 66.256028 61.119149 29.317331 55.757415 40.177307 496.123024 2.703626 1.328095 79442.502883
min 1.000000 20.000000 21.000000 1300.000000 1.000000 1.000000 1872.000000 1950.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 2006.000000 34900.000000
25% 365.750000 20.000000 59.000000 7553.500000 5.000000 5.000000 1954.000000 1967.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 5.000000 2007.000000 129975.000000
50% 730.500000 50.000000 69.000000 9478.500000 6.000000 5.000000 1973.000000 1994.000000 0.000000 383.500000 ... 0.000000 25.000000 0.000000 0.000000 0.000000 0.000000 0.000000 6.000000 2008.000000 163000.000000
75% 1095.250000 70.000000 80.000000 11601.500000 7.000000 6.000000 2000.000000 2004.000000 166.000000 712.250000 ... 168.000000 68.000000 0.000000 0.000000 0.000000 0.000000 0.000000 8.000000 2009.000000 214000.000000
max 1460.000000 190.000000 313.000000 215245.000000 10.000000 9.000000 2010.000000 2010.000000 1600.000000 5644.000000 ... 857.000000 547.000000 552.000000 508.000000 480.000000 738.000000 15500.000000 12.000000 2010.000000 755000.000000

8 rows × 38 columns

In [25]:
categorical_features = house_data.select_dtypes(include='object')
categorical_features.describe()
Out[25]:
MSZoning Street Alley LotShape LandContour Utilities LotConfig LandSlope Neighborhood Condition1 ... GarageType GarageFinish GarageQual GarageCond PavedDrive PoolQC Fence MiscFeature SaleType SaleCondition
count 1460 1460 91 1460 1460 1460 1460 1460 1460 1460 ... 1379 1379 1379 1379 1460 7 281 54 1460 1460
unique 5 2 2 4 4 2 5 3 25 9 ... 6 3 5 5 3 3 4 4 9 6
top RL Pave Grvl Reg Lvl AllPub Inside Gtl NAmes Norm ... Attchd Unf TA TA Y Gd MnPrv Shed WD Normal
freq 1151 1454 50 925 1311 1459 1052 1382 225 1260 ... 870 605 1311 1326 1340 3 157 49 1267 1198

4 rows × 43 columns


Listing unique values in categorical columns^

In [26]:
def printUniqueValues(df,cardinality=1000):
    n = df.select_dtypes(include=object)
    for column in n.columns:
        uCount = df[column].nunique()
        if uCount<=cardinality:
            print('{:>12}: {} {}'.format(column,uCount, df[column].unique()))
            #print(column,': [',uCount , '] ', df[column].unique())


printUniqueValues(house_data,10)
    MSZoning: 5 ['RL' 'RM' 'C (all)' 'FV' 'RH']
      Street: 2 ['Pave' 'Grvl']
       Alley: 2 [nan 'Grvl' 'Pave']
    LotShape: 4 ['Reg' 'IR1' 'IR2' 'IR3']
 LandContour: 4 ['Lvl' 'Bnk' 'Low' 'HLS']
   Utilities: 2 ['AllPub' 'NoSeWa']
   LotConfig: 5 ['Inside' 'FR2' 'Corner' 'CulDSac' 'FR3']
   LandSlope: 3 ['Gtl' 'Mod' 'Sev']
  Condition1: 9 ['Norm' 'Feedr' 'PosN' 'Artery' 'RRAe' 'RRNn' 'RRAn' 'PosA' 'RRNe']
  Condition2: 8 ['Norm' 'Artery' 'RRNn' 'Feedr' 'PosN' 'PosA' 'RRAn' 'RRAe']
    BldgType: 5 ['1Fam' '2fmCon' 'Duplex' 'TwnhsE' 'Twnhs']
  HouseStyle: 8 ['2Story' '1Story' '1.5Fin' '1.5Unf' 'SFoyer' 'SLvl' '2.5Unf' '2.5Fin']
   RoofStyle: 6 ['Gable' 'Hip' 'Gambrel' 'Mansard' 'Flat' 'Shed']
    RoofMatl: 8 ['CompShg' 'WdShngl' 'Metal' 'WdShake' 'Membran' 'Tar&Grv' 'Roll'
 'ClyTile']
  MasVnrType: 4 ['BrkFace' 'None' 'Stone' 'BrkCmn' nan]
   ExterQual: 4 ['Gd' 'TA' 'Ex' 'Fa']
   ExterCond: 5 ['TA' 'Gd' 'Fa' 'Po' 'Ex']
  Foundation: 6 ['PConc' 'CBlock' 'BrkTil' 'Wood' 'Slab' 'Stone']
    BsmtQual: 4 ['Gd' 'TA' 'Ex' nan 'Fa']
    BsmtCond: 4 ['TA' 'Gd' nan 'Fa' 'Po']
BsmtExposure: 4 ['No' 'Gd' 'Mn' 'Av' nan]
BsmtFinType1: 6 ['GLQ' 'ALQ' 'Unf' 'Rec' 'BLQ' nan 'LwQ']
BsmtFinType2: 6 ['Unf' 'BLQ' nan 'ALQ' 'Rec' 'LwQ' 'GLQ']
     Heating: 6 ['GasA' 'GasW' 'Grav' 'Wall' 'OthW' 'Floor']
   HeatingQC: 5 ['Ex' 'Gd' 'TA' 'Fa' 'Po']
  CentralAir: 2 ['Y' 'N']
  Electrical: 5 ['SBrkr' 'FuseF' 'FuseA' 'FuseP' 'Mix' nan]
 KitchenQual: 4 ['Gd' 'TA' 'Ex' 'Fa']
  Functional: 7 ['Typ' 'Min1' 'Maj1' 'Min2' 'Mod' 'Maj2' 'Sev']
 FireplaceQu: 5 [nan 'TA' 'Gd' 'Fa' 'Ex' 'Po']
  GarageType: 6 ['Attchd' 'Detchd' 'BuiltIn' 'CarPort' nan 'Basment' '2Types']
GarageFinish: 3 ['RFn' 'Unf' 'Fin' nan]
  GarageQual: 5 ['TA' 'Fa' 'Gd' nan 'Ex' 'Po']
  GarageCond: 5 ['TA' 'Fa' nan 'Gd' 'Po' 'Ex']
  PavedDrive: 3 ['Y' 'N' 'P']
      PoolQC: 3 [nan 'Ex' 'Fa' 'Gd']
       Fence: 4 [nan 'MnPrv' 'GdWo' 'GdPrv' 'MnWw']
 MiscFeature: 4 [nan 'Shed' 'Gar2' 'Othr' 'TenC']
    SaleType: 9 ['WD' 'New' 'COD' 'ConLD' 'ConLI' 'CWD' 'ConLw' 'Con' 'Oth']
SaleCondition: 6 ['Normal' 'Abnorml' 'Partial' 'AdjLand' 'Alloca' 'Family']

Get to know your dataset using Pandas Profiling^

Thanks to Firath's kernel : https://www.kaggle.com/frtgnn/thorough-eda-with-a-single-line-pandas-profiling/

Pandas Profiling generates profile reports from a pandas DataFrame. The pandas df.describe() function is great but a little basic for serious exploratory data analysis. pandas_profiling extends the pandas DataFrame with df.profile_report() for quick data analysis.

In [27]:
import pandas_profiling
profile_report = pandas_profiling.ProfileReport(titanic_data)
#profile_report.to_file("profile_report.html")
profile_report
Out[27]:

In [28]:
# We can use pandas profiling on selected features too.

# Using Pandas Profiling to analyse SalePrice feature in housing dataset.
import pandas_profiling
series = house_data['SalePrice']
d = { series.name : series}
df = pd.DataFrame(d) 
pandas_profiling.ProfileReport(df)
Out[28]:

Thank You! Let's Party!

If you found this useful I would be really glad if you could show your appreciation with an upvote! See you!